Implement resource-management experimental feature #14466

lisanna-dettwyler · 2025-11-03T21:56:32Z

Adds basic resource tracking to system features used by distributed builds, similar to resource management in job schedulers like Slurm.

Includes a positive and negative functional test and a documentation update to the distributed builds section.

Resolves #2307

There is a corresponding implementation for Hydra, but it is old and needs updating and rebasing. Interest has been expressed in having this feature both on the original Hydra PR NixOS/hydra#588 and in the linked Nix issue.

Motivation

Non-trivial derivations may consume various resources on remote machines, like memory or GPUs. Nix has no way to account for this beyond a one-dimensional "slots" measure per-machine. This can lead to over- or under-utilization of build machines, either needlessly limiting the number of slots to serve the most resource-intensive derivations, or running the risk of overrunning resource availability.

Context

#2307
NixOS/hydra#585

File locks are used in build-remote.cc to keep track of resource usage on the dispatching end.

Add 👍 to pull requests you find important.

The Nix maintainer team uses a GitHub project board to schedule and track reviews.

cole-h

Disclaimer: these comments are just things that came to mind as I read the PR as an interested onlooker.

src/libstore/derivation-options.cc

tests/functional/build-remote-resource-management.sh.old

cole-h · 2025-11-03T22:19:04Z

doc/manual/source/advanced-topics/distributed-builds.md

+
+For example, this builder can provide exclusive access to two GPUs and 128G of memory for remote builds:
+
+    builders = ssh://gpu-node x86_64-linux - 32 1 gpu:2,mem:128


Just to cement my understanding of this entire feature:

This line is set in the configuration of the machine-that-wants-to-dispatch-to-builders, and the values are arbitrary (i.e. foo:2 would let two requiredSystemFeatures = ["foo:1"]; derivations be scheduled on it until it needs to wait for space)? What does foo:0 do?

Recently, some of the more complex settings have been getting implemented as a JSON string (i.e. the external builders setting: #14145) -- I wonder if instead of making this part of the {required,supported}SystemFeatures setting, maybe it should be another field in the builders setting (or even another setting altogether that references a builder by hostname or something.... idk) that accepts JSON, allowing for slightly more robust filtering (i.e. "I want a machine with more than 8GiB of memory, but less than 32GiB" or "I want a machine that has both an AMD GPU and an Nvidia GPU in it").

Not to sign you up for the much more involved work that would involve, but wondering if you've considered how more complex scheduling could be tackled as well (aside from rewriting the scheduler in its entirety...).

What does foo:0 do?

It means the same as just foo (unlimited in the case of a machine's feature, and not consuming anything in the case of requiredSystemFeature.) Do you think it's worth clarifying in the documentation, since it's not something a user would likely ever use?

wondering if you've considered how more complex scheduling could be tackled as well

I hadn't, but I think the existing implementation can be used to at least satisfy the examples you've given. A machine with >8G but <32G could be chosen by adding a small supported or mandatory feature label to the machine in question... in practice you'll probably want to filter by reserving large memory machines for memory-heavy derivations, preventing small-memory derivations from effectively crowding out the larger derivations from landing at all, in which case you could use a heavy mandatory feature. A machine that has both an AMD GPU and an Nvidia GPU in it could be simply ["amdgpu:1" "nvidiagpu:1"].

I'd be interested to see what others think on the question of making it it's own field / setting. I'd also be interested in the question of forgoing this in favor of support for direct integration with job schedulers, which could handle the resource management themselves. I have some free time on my hands right now (read: laid off), so I'd be happy to work on this further depending on what others think the direction should be.

If foo:0 == foo, I'd honestly argue to reject foo:0 at parse time... As someone who was just glancing over a configuration, I'd intuit it to mean "I want something that doesn't have foo" instead of "I want something that has unlimited foos"

Agreed, I've added a check for this.

cole-h · 2025-11-03T22:23:53Z

doc/manual/source/advanced-topics/distributed-builds.md

+
+Adding `resource-management` to the `experimental-features` setting in `nix.conf` enables a basic resource management scheme for system features. This is akin to what can be accomplished with job schedulers like Slurm, where a remote machine can have a limited quantity of a resource that can be temporarily "consumed" by a job. This can be used with memory-heavy builds, or derivations that require exclusive access to particular hardware resources.
+
+Resource management is supported in both the supported features and mandatory features of a remote machine configuration, by appending a colon `:` to a feature name followed by the quantity that this machine has. This is tracked on a per-store basis, so different users on a multi-user installation share the same pool of resources for their remote build machines. A derivation specifies that it consumes a resource with the same notation in the `requiredSystemFeatures` attribute.


What does "this is tracked on a per-store basis" mean (which store)? The machine-that-wants-to-build-this-on-a-remote-builder's Nix store, or the machine-that-can-actually-build-this's Nix store? How is it tracked?

Good point, clarified this at the end of the doc entry.

Adds basic resource tracking to system features used by distributed builds, similar to resource management in job schedulers like Slurm. Includes a positive and negative functional test and a documentation update to the distributed builds section. Resolves NixOS#2307 Signed-off-by: Lisanna Dettwyler <[email protected]>

lisanna-dettwyler requested review from Ericson2314 and edolstra as code owners November 3, 2025 21:56

github-actions bot added documentation new-cli Relating to the "nix" command with-tests Issues related to testing. PRs with tests have some priority labels Nov 3, 2025

cole-h reviewed Nov 3, 2025

View reviewed changes

lisanna-dettwyler force-pushed the resource-management branch from c97f3f6 to d1ce962 Compare November 4, 2025 16:07

lisanna-dettwyler force-pushed the resource-management branch from d1ce962 to 1b69645 Compare November 4, 2025 17:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Implement resource-management experimental feature #14466

Implement resource-management experimental feature #14466

Uh oh!

lisanna-dettwyler commented Nov 3, 2025

Uh oh!

cole-h left a comment

Uh oh!

Uh oh!

Uh oh!

cole-h Nov 3, 2025

Uh oh!

lisanna-dettwyler Nov 4, 2025

Uh oh!

cole-h Nov 4, 2025

Uh oh!

lisanna-dettwyler Nov 4, 2025

Uh oh!

cole-h Nov 3, 2025

Uh oh!

lisanna-dettwyler Nov 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants


		For example, this builder can provide exclusive access to two GPUs and 128G of memory for remote builds:

		builders = ssh://gpu-node x86_64-linux - 32 1 gpu:2,mem:128


		Adding `resource-management` to the `experimental-features` setting in `nix.conf` enables a basic resource management scheme for system features. This is akin to what can be accomplished with job schedulers like Slurm, where a remote machine can have a limited quantity of a resource that can be temporarily "consumed" by a job. This can be used with memory-heavy builds, or derivations that require exclusive access to particular hardware resources.

		Resource management is supported in both the supported features and mandatory features of a remote machine configuration, by appending a colon `:` to a feature name followed by the quantity that this machine has. This is tracked on a per-store basis, so different users on a multi-user installation share the same pool of resources for their remote build machines. A derivation specifies that it consumes a resource with the same notation in the `requiredSystemFeatures` attribute.

Uh oh!

Implement resource-management experimental feature #14466

Are you sure you want to change the base?

Implement resource-management experimental feature #14466

Uh oh!

Conversation

lisanna-dettwyler commented Nov 3, 2025

Motivation

Context

Uh oh!

cole-h left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

cole-h Nov 3, 2025

Choose a reason for hiding this comment

Uh oh!

lisanna-dettwyler Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

cole-h Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

lisanna-dettwyler Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

cole-h Nov 3, 2025

Choose a reason for hiding this comment

Uh oh!

lisanna-dettwyler Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants